$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Abstract

Existing benchmarks for conversational AI agents simulate single-controlenvironments, where only the AI agent can use tools to interact with the world,while the user remains a passive information provider. This differs fromreal-world scenarios like technical support, where users need to activelyparticipate in modifying the state of the (shared) world. In order to addressthis gap, we introduce $\tau^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where bothagent and user make use of tools to act in a shared, dynamic environment thattests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse,verifiable tasks from atomic components, ensuring domain coverage andcontrolled complexity, 3) A reliable user simulator tightly coupled with the environment, whosebehavior is constrained by tools and observable states, improving simulationfidelity, 4) Fine-grained analysis of agent performance through multiple ablationsincluding separating errors arising from reasoning vscommunication/coordination. In particular, our experiments show significant performance drops when agentsshift from no-user to dual-control, highlighting the challenges of guidingusers. Overall, $\tau^2$-bench provides a controlled testbed for agents thatmust both reason effectively and guide user actions.

Quick Read (beta)

loading the full paper ...